Skip to content

feat(voice-server + installer): Google Cloud TTS + cross-platform audio#872

Open
fayerman-source wants to merge 4 commits intodanielmiessler:mainfrom
fayerman-source:feat/google-cloud-tts-v2
Open

feat(voice-server + installer): Google Cloud TTS + cross-platform audio#872
fayerman-source wants to merge 4 commits intodanielmiessler:mainfrom
fayerman-source:feat/google-cloud-tts-v2

Conversation

@fayerman-source
Copy link
Copy Markdown
Contributor

Summary

  • Google Cloud TTS as alternative provider alongside ElevenLabs — configurable via settings.json → daidentity.ttsProvider
  • Cross-platform audio playback — detects afplay (macOS), mpv/ffplay/aplay (Linux/WSL2) at startup
  • Cross-platform desktop notificationsosascript (macOS), notify-send (Linux) with silent fallback
  • Backwards compatible — defaults to ElevenLabs when ttsProvider is not set

Why

  • ElevenLabs free tier is 10K chars/month; Google Cloud free tier is 4M chars/month (Standard) or 1M (WaveNet/Neural2)
  • Voice server used macOS-only afplay and osascript, making it non-functional on Linux/WSL2
  • No new dependencies — uses Google's REST API directly via fetch

Configuration

Add to ~/.env:

GOOGLE_CLOUD_API_KEY=your_key_here
# or reuse existing: GOOGLE_API_KEY=your_key_here (both accepted)

Add to ~/.claude/settings.json:

{
  "daidentity": {
    "ttsProvider": "google-cloud",
    "googleCloudVoice": {
      "languageCode": "en-US",
      "voiceName": "en-US-Neural2-D",
      "voiceType": "NEURAL2",
      "speakingRate": 1.0,
      "pitch": 0.0
    }
  }
}

Or keep using ElevenLabs by not setting ttsProvider (or setting it to "elevenlabs").

Test plan

  • Voice server starts with ttsProvider: "google-cloud" and logs correct provider
  • Health endpoint shows voice_system: "google-cloud", google_cloud_configured: true, audio_player
  • TTS generates and plays audio end-to-end on Linux/WSL2 via mpv
  • Backwards compatible — omitting ttsProvider defaults to ElevenLabs
  • Verify on macOS with afplay

Context

Re-implementation of #687 (closed during v4.0 restructuring) with additional Linux/WSL2 cross-platform support. Tested live on WSL2 with Google Cloud Neural2-D voice.

🤖 Generated with Claude Code

fayerman-source and others added 2 commits March 2, 2026 14:19
Adds Google Cloud Text-to-Speech as alternative TTS provider and
fixes audio playback on Linux (WSL2).

- Google Cloud TTS via REST API (no SDK), configurable in settings.json
- Cross-platform audio: afplay (macOS), mpv/ffplay/aplay (Linux)
- Cross-platform notifications: osascript (macOS), notify-send (Linux)
- Backwards compatible: defaults to ElevenLabs when ttsProvider unset
- Accepts GOOGLE_CLOUD_API_KEY or GOOGLE_API_KEY from ~/.env

Re-implementation of danielmiessler#687 with additional Linux/WSL2 support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expands the installer's Step 7 (Voice Setup) to support multiple TTS
providers instead of hardcoding ElevenLabs:

- New provider selection prompt: ElevenLabs / Google Cloud TTS / Skip
- Google Cloud TTS path: key search, validation, Neural2-D default
- ElevenLabs path: unchanged existing flow
- Settings.json gets ttsProvider + googleCloudVoice when Google selected
- .env saves the correct key for chosen provider
- Key validation for Google Cloud via texttospeech.googleapis.com
- Updated types, config-gen, detect, steps descriptions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fayerman-source fayerman-source changed the title feat(voice-server): Add Google Cloud TTS + cross-platform audio playback feat(voice-server + installer): Google Cloud TTS + cross-platform audio Mar 2, 2026
Ports the Linux service installation from danielmiessler#686 to v4.0.3:

- Platform detection at startup (Darwin/Linux)
- Linux: systemd user service instead of LaunchAgent
- Linux: checks for audio player (mpv/ffplay/aplay) and notify-send
- Detects both ElevenLabs and Google Cloud API keys
- Menu bar indicator prompt only on macOS
- Removed macOS-specific "say" fallback references

Re-implementation of danielmiessler#686 install.sh changes for current architecture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
blu3dot pushed a commit to blu3dot/Personal_AI_Infrastructure that referenced this pull request Mar 3, 2026
When ELEVENLABS_API_KEY is missing or the ElevenLabs API call fails,
the VoiceServer now falls back to macOS native `say` command instead
of silently skipping voice output. Pronunciation rules from
pronunciations.json are applied to the fallback too.

- Only triggers when ElevenLabs path didn't play (no double-speak)
- Reuses existing spawnSafe() and applyPronunciations() helpers
- Fails gracefully — logs error, doesn't crash server
- Uses error: unknown with instanceof type guard
- TODO: Linux equivalent (see danielmiessler#855, danielmiessler#872)

Co-Authored-By: Claude <noreply@anthropic.com>
blu3dot pushed a commit to blu3dot/Personal_AI_Infrastructure that referenced this pull request Mar 3, 2026
…ut device

New `voice.requireHeadphones` setting in settings.json (default: false).
When enabled, VoiceServer checks the default audio output device via
`system_profiler SPAudioDataType -json` and skips voice playback if the
output is built-in laptop speakers. Voice plays normally through
Bluetooth, USB, HDMI, AirPlay, and other external audio devices.

- Uses `-json` flag for reliable machine-parseable output
- Caches detection result for 30 seconds (system_profiler takes 140-250ms)
- 3-second timeout prevents hangs if system_profiler stalls
- Fails open — if detection fails, voice plays anyway (convenience, not security)
- Desktop notification banners display regardless of headphone state
- Config uses `=== true` (opt-in, missing key defaults to OFF)
- TODO: Linux equivalent (see danielmiessler#855, danielmiessler#872)

References danielmiessler#855

Co-Authored-By: Claude <noreply@anthropic.com>
blu3dot pushed a commit to blu3dot/Personal_AI_Infrastructure that referenced this pull request Mar 3, 2026
…ut device

New `voice.requireHeadphones` setting in settings.json (default: false).
When enabled, VoiceServer checks the default audio output device via
`system_profiler SPAudioDataType -json` and skips voice playback if the
output is built-in laptop speakers. Voice plays normally through
Bluetooth, USB, HDMI, AirPlay, and other external audio devices.

- Uses `-json` flag for reliable machine-parseable output
- Caches detection result for 30 seconds (system_profiler takes 140-250ms)
- 3-second timeout prevents hangs if system_profiler stalls
- Fails open — if detection fails, voice plays anyway (convenience, not security)
- Desktop notification banners display regardless of headphone state
- Config uses `=== true` (opt-in, missing key defaults to OFF)
- TODO: Linux equivalent (see danielmiessler#855, danielmiessler#872)

References danielmiessler#855

Co-Authored-By: Claude <noreply@anthropic.com>
Merged origin/main into feat/google-cloud-tts-v2. Single conflict in
actions.ts imports — kept both PAI_VERSION/ALGORITHM_VERSION from main
and validateGoogleCloudKey from this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant